Update By Query API

您所在的位置：网站首页 › elasticsearch update开销 › Update By Query API

Update By Query API

2024-07-15 01:26| 来源: 网络整理| 查看: 265

Update documents using an ingest pipelineedit

Update by query can use the Ingest pipelines feature by specifying a pipeline:

PUT _ingest/pipeline/set-foo { "description" : "sets foo", "processors" : [ { "set" : { "field": "foo", "value": "bar" } } ] } POST my-index-000001/_update_by_query?pipeline=set-foo Get the status of update by query operationsedit

You can fetch the status of all running update by query requests with the Task API:

GET _tasks?detailed=true&actions=*byquery

The responses looks like:

{ "nodes" : { "r1A2WoRbTwKZ516z6NEs5A" : { "name" : "r1A2WoR", "transport_address" : "127.0.0.1:9300", "host" : "127.0.0.1", "ip" : "127.0.0.1:9300", "attributes" : { "testattr" : "test", "portsfile" : "true" }, "tasks" : { "r1A2WoRbTwKZ516z6NEs5A:36619" : { "node" : "r1A2WoRbTwKZ516z6NEs5A", "id" : 36619, "type" : "transport", "action" : "indices:data/write/update/byquery", "status" : { "total" : 6154, "updated" : 3500, "created" : 0, "deleted" : 0, "batches" : 4, "version_conflicts" : 0, "noops" : 0, "retries": { "bulk": 0, "search": 0 }, "throttled_millis": 0 }, "description" : "" } } } } }

This object contains the actual status. It is just like the response JSON with the important addition of the total field. total is the total number of operations that the reindex expects to perform. You can estimate the progress by adding the updated, created, and deleted fields. The request will finish when their sum is equal to the total field.

With the task id you can look up the task directly. The following example retrieves information about task r1A2WoRbTwKZ516z6NEs5A:36619:

GET /_tasks/r1A2WoRbTwKZ516z6NEs5A:36619

The advantage of this API is that it integrates with wait_for_completion=false to transparently return the status of completed tasks. If the task is completed and wait_for_completion=false was set on it, then it’ll come back with a results or an error field. The cost of this feature is the document that wait_for_completion=false creates at .tasks/task/${taskId}. It is up to you to delete that document.

Cancel an update by query operationedit

Any update by query can be cancelled using the Task Cancel API:

POST _tasks/r1A2WoRbTwKZ516z6NEs5A:36619/_cancel

The task ID can be found using the tasks API.

Cancellation should happen quickly but might take a few seconds. The task status API above will continue to list the update by query task until this task checks that it has been cancelled and terminates itself.

Change throttling for a requestedit

The value of requests_per_second can be changed on a running update by query using the _rethrottle API:

POST _update_by_query/r1A2WoRbTwKZ516z6NEs5A:36619/_rethrottle?requests_per_second=-1

The task ID can be found using the tasks API.

Just like when setting it on the _update_by_query API, requests_per_second can be either -1 to disable throttling or any decimal number like 1.7 or 12 to throttle to that level. Rethrottling that speeds up the query takes effect immediately, but rethrotting that slows down the query will take effect after completing the current batch. This prevents scroll timeouts.

Slice manuallyedit

Slice an update by query manually by providing a slice id and total number of slices to each request:

POST my-index-000001/_update_by_query { "slice": { "id": 0, "max": 2 }, "script": { "source": "ctx._source['extra'] = 'test'" } } POST my-index-000001/_update_by_query { "slice": { "id": 1, "max": 2 }, "script": { "source": "ctx._source['extra'] = 'test'" } }

Which you can verify works with:

GET _refresh POST my-index-000001/_search?size=0&q=extra:test&filter_path=hits.total

Which results in a sensible total like this one:

{ "hits": { "total": { "value": 120, "relation": "eq" } } } Use automatic slicingedit

You can also let update by query automatically parallelize using Sliced scroll to slice on _id. Use slices to specify the number of slices to use:

POST my-index-000001/_update_by_query?refresh&slices=5 { "script": { "source": "ctx._source['extra'] = 'test'" } }

Which you also can verify works with:

POST my-index-000001/_search?size=0&q=extra:test&filter_path=hits.total

Which results in a sensible total like this one:

{ "hits": { "total": { "value": 120, "relation": "eq" } } }

Setting slices to auto will let Elasticsearch choose the number of slices to use. This setting will use one slice per shard, up to a certain limit. If there are multiple source data streams or indices, it will choose the number of slices based on the index or backing index with the smallest number of shards.

Adding slices to _update_by_query just automates the manual process used in the section above, creating sub-requests which means it has some quirks:

You can see these requests in the Tasks APIs. These sub-requests are "child" tasks of the task for the request with slices. Fetching the status of the task for the request with slices only contains the status of completed slices. These sub-requests are individually addressable for things like cancellation and rethrottling. Rethrottling the request with slices will rethrottle the unfinished sub-request proportionally. Canceling the request with slices will cancel each sub-request. Due to the nature of slices each sub-request won’t get a perfectly even portion of the documents. All documents will be addressed, but some slices may be larger than others. Expect larger slices to have a more even distribution. Parameters like requests_per_second and max_docs on a request with slices are distributed proportionally to each sub-request. Combine that with the point above about distribution being uneven and you should conclude that using max_docs with slices might not result in exactly max_docs documents being updated. Each sub-request gets a slightly different snapshot of the source data stream or index though these are all taken at approximately the same time. Pick up a new propertyedit

Say you created an index without dynamic mapping, filled it with data, and then added a mapping value to pick up more fields from the data:

PUT test { "mappings": { "dynamic": false, "properties": { "text": {"type": "text"} } } } POST test/_doc?refresh { "text": "words words", "flag": "bar" } POST test/_doc?refresh { "text": "words words", "flag": "foo" } PUT test/_mapping { "properties": { "text": {"type": "text"}, "flag": {"type": "text", "analyzer": "keyword"} } }

This means that new fields won’t be indexed, just stored in _source.

This updates the mapping to add the new flag field. To pick up the new field you have to reindex all documents with it.

Searching for the data won’t find anything:

POST test/_search?filter_path=hits.total { "query": { "match": { "flag": "foo" } } } { "hits" : { "total": { "value": 0, "relation": "eq" } } }

But you can issue an _update_by_query request to pick up the new mapping:

POST test/_update_by_query?refresh&conflicts=proceed POST test/_search?filter_path=hits.total { "query": { "match": { "flag": "foo" } } } { "hits" : { "total": { "value": 1, "relation": "eq" } } }

You can do the exact same thing when adding a field to a multifield.

【本文地址】

Update By Query API

Update By Query API

今日新闻

推荐新闻